D984-D991 Nucleic Acids Research, 2012, Vol. 40, Database issue 
doi:10.1093/nar/gkrl051 



Published online 24 November 2011 



The Stem Cell Discovery Engine: an integrated 
repository and analysis system for cancer 
stem cell comparisons 

Shannan J. Ho Sui 1 ' 2 '*, Kimberly Begley 1 ' 2 , Dorothy Reilly 1 ' 2,3 , Brad Chapman 1 ' 2 , 
Ray McGovern 1 ' 2 , Philippe Rocca-Sera 4 , Eamonn Maguire 4 , Gabriel M. Altschuler 1 , 
Terah A. A. Hansen 1 ' 2 , Ramakrishna Sompallae 1 , Andrei Krivtsov 5 ' 6 , 
Ramesh A. Shivdasani 6,7 , Scott A. Armstrong 5,6,7 , Aedm C. Culhane 1,8 , Mick Correll 8,9 , 
Susanna-Assunta Sansone 3 , Oliver Hofmann 1,2 and Winston Hide 1,2,7 '* 

department of Biostatistics, 2 HSPH Bioinformatics Core, Harvard School of Public Health, Boston, MA, 
developmental and Molecular Pathways, Novartis Institutes for BioMedical Research, Cambridge, MA, USA, 
4 Oxford e-Research Centre, University of Oxford, UK, department of Pediatric Oncology, Children's Hospital, 
6 Dana Farber Cancer Institute, and Harvard Medical School, Boston, 7 Harvard Stem Cell Institute, Cambridge, 
department of Biostatistics, Dana Farber Cancer Institute and 9 Center for Cancer Computational Biology, 
Dana Farber Cancer Institute, Boston, MA, USA 

Received August 22, 2011; Revised October 13, 2011; Accepted October 25, 2011 



ABSTRACT 

Mounting evidence suggests that malignant tumors 
are initiated and maintained by a subpopulation of 
cancerous cells with biological properties similar 
to those of normal stem cells. However, descrip- 
tions of stem-like gene and pathway signatures in 
cancers are inconsistent across experimental 
systems. Driven by a need to improve our under- 
standing of molecular processes that are common 
and unique across cancer stem cells (CSCs), we 
have developed the Stem Cell Discovery Engine 
(SCDE)— an online database of curated CSC experi- 
ments coupled to the Galaxy analytical framework. 
The SCDE allows users to consistently describe, 
share and compare CSC data at the gene and 
pathway level. Our initial focus has been on carefully 
curating tissue and cancer stem cell-related experi- 
ments from blood, intestine and brain to create 
a high quality resource containing 53 public 
studies and 1098 assays. The experimental informa- 
tion is captured and stored in the multi-omics 
Investigation/Study/Assay (ISA-Tab) format and 
can be queried in the data repository. A linked 
Galaxy framework provides a comprehensive, 
flexible environment populated with novel tools for 



gene list comparisons against molecular signatures 
in GeneSigDB and MSigDB, curated experiments in 
the SCDE and pathways in WikiPathways. The SCDE 
is available at http://discovery.hsci.harvard.edu. 

INTRODUCTION 

Cells in adult non-germinal tissues such as blood, skin and 
intestine turn over briskly and are known to require stem 
cells for lifelong renewal. These tissue stem cells are 
capable of proliferation and self-renewal, and can 
produce differentiated progeny through the expression of 
tissue-specific genes. Recent evidence suggests that 
studying adult stem cells can provide insight into cancer 
cell biology. Only small fractions of tumor-derived cells 
are clonogenic in culture or tumorigenic in vivo (1,2). 
Cancers are therefore thought to rely on the activity of 
stem or stem-like cells that are tumorigenic and exhibit the 
cardinal properties of self-renewal and multi-lineage dif- 
ferentiation potential. 

Stem and differentiated cells within a tumor are 
reported to differ in sensitivity toward therapy (3). 
Studies have independently established embryonic stem 
cell gene expression signatures where cancer subtypes 
with poor survival prognosis are enriched in treatment- 
resistant, stem-like cells. Stem cell signatures resulting in 
poor prognosis have so far been found in glioma, breast, 
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lung, colon and esophageal cancers (4-10). Comparing 
stem cell populations therefore has the potential to 
identify new molecular targets for drug and immune 
therapies that destroy the self-renewing cancer stem cells 
(CSCs). However, descriptions of gene and pathway 
stem-like signatures across cancers are inconsistent 
across platforms, tissues and laboratories. 

Driven by a need to understand CSC molecular profiles 
generated at the Harvard Stem Cell Institute (HSCI), we 
have developed a platform to integrate CSC experimental 
information: the Stem Cell Discovery Engine (http://dis- 
covery.hsci.harvard.edu). We have collected, curated and 
integrated this data into the Stem Cell Discovery Engine 
(SCDE) to permit molecular comparisons between normal 
and cancerous stem cells, between stem-cell compartments 
in blood, intestine and brain, and between mouse models 
and human tissues. 

SCDE overview 

The SCDE is a modular online system designed to handle 
data submission, curation, analysis, integration and dis- 
semination of stem cell-related experiments (Figure 1). 
The system has two components: (i) a tissue and cancer 
stem cell database accessible through the Biolnvestigation 
Index (BII) (11) and (ii) a customized instance of the 
Galaxy analysis engine (12,13). It includes tools that inte- 
grate public stem cell data with user-submitted experi- 
ments. Its initial focus is on gene list manipulation, and 
interaction with the curated Gene Signatures Database 
(GeneSigDB) (14), Molecular Signatures Database 
(MSigDB) (15), and WikiPathways pathway database 
(16) (Figure 1). A description of the database in accord- 
ance with BioDBCore standards (17) is available in 
Supplementary Table SI. 

Curation of experimental metadata and derived data 

The SCDE database provides a source of structured ex- 
perimental information on assays, derived gene lists and 
pathway profiles. Heterogeneity in experimental informa- 
tion has been reduced by rigorous, manual curation of the 
experimental model, cell and tissue types, disease state, 
surface markers and other relevant data. Submitted user 
data is first checked for relevance, i.e. studies must be 
performed using well-defined stem cell, tissue stem cell 
and/or cancer stem cell populations, and must produce 
genome-scale data with potential to provide insight into 
the stem-like characteristics of cancers. All of the raw data 
with its sample characteristics must be available. Data 
input fields are then mapped to the ontologies listed in 
Table 1 according to species-specificity and overall 
coverage of the ontology. New terms are submitted to 
the ontology maintainers for future inclusion. This 
ensures that new terms are standardized and incorporated 
for community use. Experimental protocols and analytical 
methods are annotated with the goal of providing suffi- 
cient information to reproduce or perform similar experi- 
ments and to derive the processed data. Derived data in 
the form of gene lists are converted to standardized iden- 
tifiers to be used for gene list comparisons within Galaxy. 



We stored experimental metadata in the Investigation/ 
Study/Assay (ISA-Tab) format, i.e. high level information 
about the experiment is recorded in the 'investigation' file, 
sample attributes and factors in the 'study' files, and 
protocols and analysis methods in the 'assay' files. 
This general purpose tab-delimited grammar manages 
metadata from diverse studies, and enables users to align 
with community-defined minimum information, onto- 
logies and checklists (11,18,19). It comes with support 
tools for curation (including semi-automated annotation 
tagging through the NCBO BioPortal annotation service 
(20) to speed the process) and format conversion (http:// 
isatab.sourceforge.net) to make it straightforward to 
submit data to international public repositories, such as 
the Gene Expression Omnibus (GEO) (21). ISA-Tab is 
supported and maintained by a global collaboration of 
biocurators (22). While the initial cost of curation is 
high, it allows for sharing of ISA-Tab configurations we 
have developed specifically for stem cell data that can be 
used within the various ISA tools by the stem cell com- 
munity. The goal is to build a curation network and es- 
tablish community involvement so that standards are 
agreed upon and adopted. 

Database contents 

A primary focus was a selection of studies related to 
normal and CSCs, and in particular for three model 
systems — blood, intestine and brain. In these tissues, the 
behavior of native stem cells is especially well 
characterized, investigators generally agree on stem cell 
definitions, and cancer is common. Table 2 shows the dis- 
tribution of data across organisms, tissues and types of 
measurements. The database integrates 53 public studies 
comprised of 1098 molecular assays from CSC-related ex- 
periments from multiple tissues, species and heteroge- 
neous platforms. Five additional studies comprised of 84 
assays are stored as private, unpublished data that are 
available to specific researchers upon login and are 
ready for dissemination upon publication. Fifteen 
studies were contributed by researchers in the HSCI com- 
munity and an additional 40 studies related to CSC 
biology were selected from StemBase (23,24). Forty-six 
studies were performed in rodent models and 13 in 
human cells; these include two studies containing 
samples assayed from both rodent and human models. 
The database is made up in large part by microarray ex- 
pression profiling studies but results from nucleotide 
sequencing (i.e. ChlP-seq) studies of histone methylation 
and transcription factor binding, histology and expression 
analysis by RT-PCR are also included. 

Data acquisition and dissemination 

Researchers can submit their own data or suggest public 
data to a curator, who manually curates it according to 
community-accepted standards and ontologies (Table 1). 
In cases where published studies have associated data de- 
posited in ArrayExpress, the MAGEtoISA converter tool 
permits rapid conversion from MAGE-TAB to ISA-Tab 
format, which is then manually evaluated by a curator for 
completeness and corrected where necessary. 
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Figure 1. System architecture diagram showing integration of data into the SCDE Biolnvestigation Index (BII) and Galaxy instances. CSC-related 
experiments are submitted by stem cell researchers or selected from public repositories. After curation using the ISA tools and conversion to ISA-Tab 
format, the associated metadata, raw data files and processed gene lists are stored in the BII. The stem cell-specific gene lists are transformed into 
standardized gene identifiers to facilitate integration and comparison against similarly formatted reference lists (GeneSigDB, MSigDB, WikiPathways 
and other SCDE experiments) within Galaxy. 



Table 1. Curated metadata 



Field 


Ontologies (in order of preference) 


Organism 


NEWT UniProt Taxonomy Database (Newt), NCBI Taxonomy (NCBITaxon) 


Strain 


Experimental Factor Ontology (EFO) 


Developmental stage 


EFO 


Disease state 


ICD-9, NCI Thesaurus, Disease Ontology 


Organism part (tissue type) 


Foundational Model of Anatomy, Mouse Gross Anatomy, BRENDA tissue/enzyme source (BTO), EFO 


Cell type 


Cell Type Ontology (CL), EFO 


Cell line 


EFO, NCI Thesaurus 


Genotype 


Ontology for Biomedical Investigations (OBI; depending on species) 


Cell surface marker 


Currently annotated as ±, high/lo. An appropriate standard needs to be developed. 


Immunoprecipitation antibody 


Protein Name (specify manufacturer where available) 


Binding site 


SO (sequence ontology) for methylation sites 


Phenotypic quality 


PATO: Phenotypic qualities (properties) 


Treatment (perturbation) 


PATO, CHEBI: Chemical Entities of Biological Interest OBI (to describe perturbations such as 




genetic modification, transient expression) 
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Table 2. SCDE data 





Studies 11 


Assays' 1 


Ortra ni qiti 

V7 1 titXlll 3111 






A/fni i 

1V1 \J L 1 


45 I4\ 




Hum a n 


1 ^ n i 




Rat 


1 


18 


i issue Lypc 






DIOOU^UOIIC maiiuw 


ZU (J) 




Muscle 


8 


125 


Brain/neural 


6 (1) 


68 (6) 


Intestine 


4 


39 


Mammary 


2 


34 


Skin 


2 (1) 


135 (30) 


Measurement 15 






Transcription profiling 


57 (5) 


1161 (84) 


Histone modification profiling 


2 


42 


Transcription factor binding site identification 


1 


21 


Tissue histology 


1 (1) 


6 (6) 



"Total number of studies and assays; the number of private studies and 
assays are shown in brackets 

b Further information and details of technology platforms are available 
online at http://discovery.hsci.harvard.edu. 



To ensure that all stem cell data are comparable, primary 
and derived data sets are organized in a standardized 
manner and disseminated to the public using a local 
instance of the SCDE Bioinvestigation Index (BII). This 
data repository is designed to support storage, querying 
and display of multi-omics data sets (11). The annotated 
metadata allows users to search the entire corpus of ex- 
periments in the BII based on organism, measurement 
type (e.g. transcriptional profiling), technology (e.g. nu- 
cleotide sequencing), and platform (e.g. Illumina) or to 
search free text across all fields (Figure 2A). Study pages 
display the details of each experiment (Figure 2B-D). The 
annotation has focused on ensuring that cell types, tissues 
and experimental variables are consistently reported to 
improve query capabilities, and to establish sound anno- 
tation practices to describe stem cell research (e.g. descrip- 
tions of genetic modifications). 

Published studies are automatically made publicly avail- 
able. ISA-Tab formatted metadata can be downloaded 
for information pertaining to the assays, such as normal- 
ization procedures for microarray experiments and GEO 
accession identifiers where available. Raw primary data 
(e.g. CEL files for Affymetrix microarrays) and processed 
derived data (e.g. author-generated gene lists) can also be 
downloaded from the BII using the 'Raw Data' and 
'Processed Data' buttons (Figure 2E). Alternatively, the 
data can be accessed within the SCDE Galaxy framework 
for analysis as described in the following section. 
Researchers with the appropriate access permissions can 
query unpublished data to perform early analyses, and 
upon publication, have the added benefit of exporting 
their ISA-Tab formatted data for submission to 
ArrayExpress using the conversion tools. The correspond- 
ing functionality for submission to GEO in MiniML 
format is in progress and will represent a valuable incen- 
tive for the stem cell community to use the SCDE as a first 
port of call for submission of CSC functional genomics 
data. 



Querying CSC molecular signatures using Galaxy 

In addition to querying experimental metadata, the SCDE 
provides functionality to interrogate stem cell molecular 
profiles in a linked Galaxy instance, with the goal of iden- 
tifying similarities and differences between normal and 
cancer stem cell experiments. All raw and processed data 
stored in the BII and several additional manually curated 
stem cell-related gene lists are accessible from within 
Galaxy for analysis. 

Manual curation and consistent identifier conver- 
sion differentiate the SCDE from other gene list compari- 
son tools. Derived gene lists have been mapped to 
standardized gene symbols using methods developed for 
GeneSigDB. Such standardization allows for comparisons 
to determine genes that are shared or unique across 
experiments. Tools are available to compare a single 
gene list (SCDE ListMatch) or multiple gene lists 
(SCDE ListCompare) against curated gene signatures in 
GeneSigDB, molecular signatures in MSigDB, derived 
gene lists from the SCDE database and pathways 
in WikiPathways. These tools allow users to identify 
genes in common with defined reference signatures 
and pathways (Figure 3). Results are summarized and 
ranked according to a hypergeometric test P-value and 
linked to the relevant overlapping gene sets (Figure 3B). 
For WikiPathways comparisons, a link is provided to 
visualize the gene matches in color-coded diagrams of ca- 
nonical pathways (Figure 3C). The SCDE Intersect tool 
identifies genes that are common to multiple gene lists. By 
using the Galaxy interface, users can maintain a record of 
their analysis history and easily compare multiple data sets 
stored in their history. 

DISCUSSION 

The SCDE database provides a repository for curated 
CSC data and a framework for developing methods to 
compare molecular information on stem cell related popu- 
lations. We illustrate the functionality of the SCDE using 
the following use case as an example. A leukemia research- 
er enters the SCDE through the BII interface. A search for 
the term 'leukemia' in the free text search box produces 
five transcriptional profiling studies performed in mouse 
models. The user selects the first result (ARMSTRONG- 
S' 1) to obtain further details of the study and is provided 
with information about genetic modifications, hematopoi- 
etic progenitor cell types, immunophenotypes, type 
of leukemia studied and the mouse strain used in the ex- 
periment. Wishing to perform a related experiment, the 
researcher downloads the experimental metadata in 
ISA-Tab format, which provides him with additional in- 
formation about the sample cell types, labeling protocol, 
microarray chip used, number of replicates, normalization 
procedure, etc. After performing his experiment, the re- 
searcher returns to the SCDE to determine how similar his 
results are to the ARMSTRONG-S-1 study, or indeed to 
any of the experiments in the SCDE. Using the Galaxy 
web interface, he uploads his list of differentially expressed 
genes from his leukemia experiment and uses the 
ListMatch tool to determine the following: (i) significant 
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Browse Submit Credit Contact 
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SHIVDASANI-S-l 



Transcription factor binding during intestinal cell differentiation 



SHIVDASANI-S-2 



Expression Study in CDX2 knock-out mice 



Filter on Platform 

Q. search 

Results filtered on: intestine Clear 

J studies containing assays 
homo sapiens (human) 

4 transcription profiling using DNA microarray 

mus musculus (mouse) 

8 transcription profiling using DNA microarray 

homo sapiens (human) 

12 transcription profiling using DNA microarray 

2 histone modification profiling using nucleotide sequencing 

7 transcription factor binding site identification using nucleotide 
sequencing 

mus musculus (mouse) 
6 transcription profiling using DNA microarray 



SHIVDASANI-S-l 



B 



Investigation: 
Title: 

Organism(s): 
Description: 



This Study is part of an Investigation, which also includes: SHIVDASANI-S-2 
Transcription factor binding during intestinal cell differentiation 
Homo sapiens (Human) 

Caco-2 human intestinal cells are widely used to investigate epithelial functions, intestinal gene expression, and transcriptional mechanisms of 
differentiation. To identify cis-regulatory sequences associated with Caco-2 cell differentiation, we evaluated a distinct epigenetic pattern attributed 
to active enhancer elements, abundant H3K4Me2 and H3K27Ac marks. To verify CDX2 binding at enhancer regions we used ChlP-seq to identify 
CDX2 binding sites. Dynamic CDX2 occupancy corresponds with condition-specific gene expression and to differential occupancy with other tissue- 
restricted transcription factors: HNFA in mature cells and GATA6 in progenitors. H3K4me2, CDX2, GATA6, and HNF4A ChlP-seq patterns were 
mapped in proliferating and 26 day post-confluent (differentiated) Caco-2 intestinal cell lines. To determine association of protein binding with gene 
expression, Caco-2 expression microarray data (Fleet et al., 2003) were enumerated based on log-fold changes in differentiated over proliferating 
cells. 



Design(s): 

Experimental 
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parallel group design 



time 

4 recorded 



antibody 

4 recorded 
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differentiation status 
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Figure 2. Screenshots showing elements of the Biolnvestigation Index browse view. (A) The results of a free text search using the term 'intestine' that 
retrieves four studies — two human and two murine — that include transcription profiling using DNA microarrays, and ChlP-seq for transcription 
factor binding and histone modifications; (B) One matching record, SHIVDASANI-S-l, with descriptive text; (C) Annotated experimental factor 
metadata including time, antibodies used in the chromatin immunoprecipitation (ChIP) experiments, and cell differentiation status; (D) Annotated 
sample metadata including the tissue and cell types, using the Foundational Model of Anatomy (FMA) and Brenda Tissue (BTO) ontologies, 
respectively; (E) Download panel for the raw and processed data. 
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PPARG PPARD 



] t ^ABCAl |- 



1 25-Dihydroxy-Viiamins D3 



| 7-DehydroCholesierol 



Highlights ■ 



SCDE ListMatch - [ Tue Oct 11 18:26:08 2011 ] 

GeneSigDB Lists Matched: 198 of 892 



ID 


Description 


Genes 
matched 


Matched gene IDs 


p-value 


19061838 


Table S4b. GSEA: Rank-ordered list of upregulated genes 
associated to beta-catenin mutations in hepatocellular 
carcinoma according to Stahl et al (2005) and Boyault et al 
(2007) in hepatoblastoma versus non-tumor liver. 


5 of 68 


PANX1, SLC5A6, 
GALNT1, REG3A, HEXA 


3.12e-06 


16752223 


Table 4. ABC transporters selected as best classifiers at a 
significance threshold of 0.003 


1 of 3 


ABCA1 


1.06e-04 



Figure 3. Composite figure showing the results of a ListMatch query using the set of intestinal differentiation genes that are reduced upon Cdx2 
depletion from the SHIVDASANI-S-2 study. (A) SCDE ListMatch input page with options to compare against WikiPathways, GeneSigDB, 
MSigDB and the SCDE repository. (B) Results of the query against WikiPathways projected onto the canonical pathway representation with 
matching genes highlighted in red (partial screenshot shown). (C) Querying against GeneSigDB results in a top match to genes related to liver cancer. 



overlap with gene signatures from SCDE experiments (this 
may reveal similarities to the leukemia studies or other 
hematopoietic stem cell experiments contained in the 
SCDE; (ii) genes enriched in curated signatures from 
GeneSigDB or mSigDB (such overlaps provide informa- 
tion about similar diseases states, positional biases and 
functional groupings) and (hi) genes that overlap with 
known pathways from WikiPathways (genes are projected 
onto the canonical pathway diagram to indicate where 
they occur within the pathway). Going a step further, 
the researcher uses the ListCompare tool to find the 
overlap with the ARMSTRONG-S- 1 gene list with 



reference to canonical pathways in WikiPathways. This 
allows him to identify pathways that contain genes from 
both lists even where the intersection of the two lists is 
small, generating hypotheses about possible pathways to 
study further. Having done his analysis within Galaxy, the 
researcher saves the gene lists, parameters and results and 
can share this data with his collaborators or make it 
publicly available. 

The SCDE is unique in its community-oriented 
approach for identifying relevant experiments, capturing 
and curating study information, and integrating new 
analysis capabilities compared to previous resources. The 
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adoption of the ISA-Tab format permits inclusion of 
multiple diverse data types and demonstrates that the 
tools we have used are ready for scale up. The Galaxy 
framework allows us to rapidly add relevant analysis 
methods developed by the growing Galaxy development 
community in which we are active participants. The im- 
plementation of open source software projects that are 
gaining community support will ensure that the SCDE 
continues to evolve. The tools developed for the SCDE 
Galaxy instance have been published on bitbucket at the 
URL http://bitbucket.org/hbc/galaxy-central-hbc. By 
publishing this resource and making the infrastructure 
available, we hope to develop the stem cell community 
and obtain feedback on annotation practices, relevant 
data sets and analytical methods. 

Future directions 

While the comparison of gene signatures is informative, a 
systematic approach to compare and determine the role of 
key pathway contributions across different experimental 
systems and cancers against a consistent background is 
needed. A pathway fingerprinting method to determine 
functional similarity among experiments independently 
of platform or species is being developed for integration 
into the SCDE (Altschuler et ah, submitted). We will 
continue to expand the SCDE to include additional 
CSC-related studies and new data types, and work with 
the stem cell community to further refine relevant ontology 
terms, as has been the case for the Cell Ontology (25). A 
further focus will be to develop methods to integrate epi- 
genetic data with gene expression. Scientists interested in 
adding or curating studies, or in implementing analysis 
options that are not yet available, are encouraged to 
contact us at scde@hsci.harvard.edu. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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